{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 前言\n",
    "\n",
    "正则表达式是一个特殊的字符序列，它能帮助你方便的检查一个字符串是否与某种模式匹配。\n",
    "\n",
    "Python 自1.5版本起增加了re 模块，它提供 Perl 风格的正则表达式模式。\n",
    "\n",
    "re 模块使 Python 语言拥有全部的正则表达式功能。\n",
    "\n",
    "compile 函数根据一个模式字符串和可选的标志参数生成一个正则表达式对象。该对象拥有一系列方法用于正则表达式匹配和替换。\n",
    "\n",
    "re 模块也提供了与这些方法功能完全一致的函数，这些函数使用一个模式字符串做为它们的第一个参数。\n",
    "\n",
    "本章节主要介绍Python中常用的正则表达式处理函数。\n",
    "\n",
    "# 参考\n",
    "\n",
    "链接：https://www.runoob.com/python/python-reg-expressions.html\n",
    "\n",
    "---\n",
    "\n",
    "## `re.match`函数\n",
    "`re.match` 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。\n",
    "\n",
    "函数语法：\n",
    "```\n",
    "re.match(pattern, string, flags=0)\n",
    "```\n",
    "\n",
    "函数参数说明：\n",
    "\n",
    "|参数\t|描述|\n",
    "|----|----|\n",
    "|pattern|匹配的正则表达式|\n",
    "|string\t|要匹配的字符串。|\n",
    "|flags\t|标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。|\n",
    "\n",
    "匹配成功re.match方法返回一个匹配的对象，否则返回None。\n",
    "\n",
    "我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。\n",
    "\n",
    "|匹配对象方法\t|描述|\n",
    "|----|----|\n",
    "|group(num=0)|\t匹配的整个表达式的字符串，group() 可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组。|\n",
    "|groups()|\t返回一个包含所有小组字符串的元组，从 1 到 所含的小组号。|\n",
    "\n",
    "## 实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 3)\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "print(re.match('www', 'www.runoob.com').span())  # 在起始位置匹配\n",
    "print(re.match('com', 'www.runoob.com'))         # 不在起始位置匹配\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "matchObj.group() :  Cats are smarter than dogs\n",
      "matchObj.group(1) :  Cats\n",
      "matchObj.group(2) :  smarter\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    " \n",
    "line = \"Cats are smarter than dogs\"\n",
    " \n",
    "matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)\n",
    " \n",
    "if matchObj:\n",
    "   print(\"matchObj.group() : \", matchObj.group())\n",
    "   print(\"matchObj.group(1) : \", matchObj.group(1))\n",
    "   print(\"matchObj.group(2) : \", matchObj.group(2))\n",
    "else:\n",
    "   print(\"No match!!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `re.search`方法\n",
    "`re.search` 扫描整个字符串并返回第一个成功的匹配。\n",
    "\n",
    "函数语法：\n",
    "```\n",
    "re.search(pattern, string, flags=0)\n",
    "```\n",
    "\n",
    "函数参数说明：\n",
    "\n",
    "|参数\t|描述|\n",
    "|----|----|\n",
    "|pattern|\t匹配的正则表达式|\n",
    "|string|\t要匹配的字符串。|\n",
    "|flags|\t标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。|\n",
    "\n",
    "匹配成功re.search方法返回一个匹配的对象，否则返回None。\n",
    "\n",
    "我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。\n",
    "\n",
    "|匹配对象方法|\t描述|\n",
    "|----|----|\n",
    "|group(num=0)\t|匹配的整个表达式的字符串，group() 可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组。|\n",
    "|groups()|\t返回一个包含所有小组字符串的元组，从 1 到 所含的小组号。|\n",
    "\n",
    "## 实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 3)\n",
      "(11, 14)\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "print(re.search('www', 'www.runoob.com').span())  # 在起始位置匹配\n",
    "print(re.search('com', 'www.runoob.com').span())         # 不在起始位置匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "searchObj.group() :  Cats are smarter than dogs\n",
      "searchObj.group(1) :  Cats\n",
      "searchObj.group(2) :  smarter\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    " \n",
    "line = \"Cats are smarter than dogs\";\n",
    " \n",
    "searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)\n",
    " \n",
    "if searchObj:\n",
    "   print(\"searchObj.group() : \", searchObj.group())\n",
    "   print(\"searchObj.group(1) : \", searchObj.group(1))\n",
    "   print(\"searchObj.group(2) : \", searchObj.group(2))\n",
    "else:\n",
    "   print(\"Nothing found!!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `re.match`与`re.search`的区别\n",
    "`re.match`只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回`None`；而`re.search`匹配整个字符串，直到找到一个匹配。\n",
    "\n",
    "## 实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No match!!\n",
      "search --> searchObj.group() :  dogs\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    " \n",
    "line = \"Cats are smarter than dogs\";\n",
    " \n",
    "matchObj = re.match( r'dogs', line, re.M|re.I)\n",
    "if matchObj:\n",
    "   print(\"match --> matchObj.group() : \", matchObj.group())\n",
    "else:\n",
    "   print(\"No match!!\")\n",
    " \n",
    "matchObj = re.search( r'dogs', line, re.M|re.I)\n",
    "if matchObj:\n",
    "   print(\"search --> searchObj.group() : \", matchObj.group())\n",
    "else:\n",
    "   print(\"No match!!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 检索和替换\n",
    "Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。\n",
    "\n",
    "语法：\n",
    "```\n",
    "re.sub(pattern, repl, string, count=0, flags=0)\n",
    "```\n",
    "参数：\n",
    "\n",
    "+ `pattern` : 正则中的模式字符串。\n",
    "+ `repl` : 替换的字符串，也可为一个函数。\n",
    "+ `string` : 要被查找替换的原始字符串。\n",
    "+ `count` : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。\n",
    "\n",
    "## 实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "电话号码是:  2004-959-559 \n",
      "电话号码是 :  2004959559\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    " \n",
    "phone = \"2004-959-559 # 这是一个国外电话号码\"\n",
    " \n",
    "# 删除字符串中的 Python注释 \n",
    "num = re.sub(r'#.*$', \"\", phone)\n",
    "print(\"电话号码是: \", num)\n",
    " \n",
    "# 删除非数字(-)的字符串 \n",
    "num = re.sub(r'\\D', \"\", phone)\n",
    "print(\"电话号码是 : \", num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### repl 参数是一个函数\n",
    "以下实例中将字符串中的匹配的数字乘以 2："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A46G8HFD1134\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    " \n",
    "# 将匹配的数字乘以 2\n",
    "def double(matched):\n",
    "    value = int(matched.group('value'))\n",
    "    return str(value * 2)\n",
    " \n",
    "s = 'A23G4HFD567'\n",
    "print(re.sub('(?P<value>\\d+)', double, s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `re.compile` 函数\n",
    "compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。\n",
    "\n",
    "语法格式为：\n",
    "```\n",
    "re.compile(pattern[, flags])\n",
    "```\n",
    "\n",
    "参数：\n",
    "\n",
    "+ `pattern` : 一个字符串形式的正则表达式\n",
    "\n",
    "+ `flags` : 可选，表示匹配模式，比如忽略大小写，多行模式等，具体参数为：\n",
    "\n",
    "    + `re.I` 忽略大小写\n",
    "    + `re.L` 表示特殊字符集 `\\w`, `\\W`, `\\b`, `\\B`, `\\s`, `\\S` 依赖于当前环境\n",
    "    + `re.M` 多行模式\n",
    "    + `re.S` 即为 . 并且包括换行符在内的任意字符（`.` 不包括换行符）\n",
    "    + `re.U` 表示特殊字符集 `\\w`, `\\W`, `\\b`, `\\B`, `\\s`, `\\S` 依赖于 Unicode 字符属性数据库\n",
    "    + `re.X` 为了增加可读性，忽略空格和` # `后面的注释\n",
    "    \n",
    "## 实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "pattern = re.compile(r'\\d+')                    # 用于匹配至少一个数字\n",
    "m = pattern.match('one12twothree34four')        # 查找头部，没有匹配\n",
    "print(m)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "\n",
    "m = pattern.match('one12twothree34four', 2, 10) # 从'e'的位置开始匹配，没有匹配\n",
    "print(m)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 5), match='12'>\n"
     ]
    }
   ],
   "source": [
    "\n",
    "m = pattern.match('one12twothree34four', 3, 10) # 从'1'的位置开始匹配，正好匹配\n",
    "print(m)                                        # 返回一个 Match 对象\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n",
      "3\n",
      "5\n",
      "(3, 5)\n"
     ]
    }
   ],
   "source": [
    "print(m.group(0))   # 可省略 0\n",
    "\n",
    "print(m.start(0))   # 可省略 0\n",
    "\n",
    "print(m.end(0))    # 可省略 0\n",
    "\n",
    "print(m.span(0))    # 可省略 0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 案例解读\n",
    "\n",
    "在上面，当匹配成功时返回一个 Match 对象，其中：\n",
    "\n",
    "+ `group([group1, …])` 方法用于获得一个或多个分组匹配的字符串，当要获得整个匹配的子串时，可直接使用 `group()` 或 `group(0)`；\n",
    "+ `start([group])` 方法用于获取分组匹配的子串在整个字符串中的起始位置（子串第一个字符的索引），参数默认值为 0；\n",
    "+ `end([group])` 方法用于获取分组匹配的子串在整个字符串中的结束位置（子串最后一个字符的索引+1），参数默认值为 0；\n",
    "+ `span([group])` 方法返回 `(start(group), end(group))`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 邮箱正则"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "541397597@qq.com\n",
      "542951435@qq.com\n",
      "491955467@qq.com\n",
      "1273972207@qq.com\n",
      "1726766047@qq.com\n",
      "1060959464@qq.com\n",
      "532872993@qq.com\n",
      "532872993@qq.com\n",
      "1198384227@qq.com\n",
      "2237512508@qq.com\n",
      "376941413@qq.com\n",
      "376941413@qq.com\n",
      "465012300@qq.com\n",
      "1628481353@qq.com\n",
      "965300698@qq.com\n",
      "1556240636@qq.com\n",
      "1106997643@qq.com\n",
      "1120057735@qq.com.\n",
      "1528519246@qq.com\n",
      "1340877544@qq.com\n",
      "1246321320@qq.com\n",
      "1440828872@qq.com\n",
      "2314999678@qq.com\n",
      "2314999678@qq.com\n",
      "2314999678@qq.com\n",
      "2314999678@qq.com\n",
      "1091424036@qq.com\n",
      "301056188@qq.com\n",
      "1149441496@qq.com\n",
      "541397597@qq.com\n"
     ]
    }
   ],
   "source": [
    "import requests, re\n",
    " \n",
    "regex = r\"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)\"\n",
    "#这个正则表达式过滤掉了qq邮箱\n",
    "# regex = r\"([a-zA-Z0-9_.+-]+@[a-pr-zA-PRZ0-9-]+\\.[a-zA-Z0-9-.]+)\"\n",
    "#基于隐私，使用了“XXXXXXXXXXXXXX”\n",
    "url = 'http://tieba.baidu.com/p/5527502385?pn=2'\n",
    "html = requests.get(url).text\n",
    "#print(html)\n",
    "emails = re.findall(regex,html)\n",
    "i = 0\n",
    "for email in emails:\n",
    "    i += 1\n",
    "    if i < 1000:\n",
    "        print(\"{}\".format(email))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
